正文
36 - Introduction to Pandas - Data reading and handling
- 绘制直方图
import pandas as pd
df = pd.read_csv('images/grains/grain_measurements.csv')
df['Area'].plot(kind='hist', title='Area', bins=50)<AxesSubplot:title={'center':'Area'}, ylabel='Frequency'>
- 创建 DataFrame, 修改 index 和 columns
data = [[10, 200, 60],
[12, 155, 45],
[9, 50, -45.],
[16, 240, 90]]
df = pd.DataFrame(data, index=[1, 2, 3, 4], columns=['Area', 'Intensity', 'Orientation'])
df| Area | Intensity | Orientation | |
|---|---|---|---|
| 1 | 10 | 200 | 60.0 |
| 2 | 12 | 155 | 45.0 |
| 3 | 9 | 50 | -45.0 |
| 4 | 16 | 240 | 90.0 |
- 查看各行缺失信息
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 100 non-null object
1 Image 100 non-null object
2 Manual 94 non-null float64
3 Manual2 3 non-null float64
4 Auto_th_2 100 non-null int64
5 Auto_th_3 100 non-null int64
6 Auto_th_4 100 non-null int64
dtypes: float64(2), int64(3), object(2)
memory usage: 5.6+ KB
- 查看表格行列数
df.shape(100, 7)
- 查看整个表格
df| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | Set4 | Image96 | 106.0 | NaN | 75 | 112 | 98 |
| 96 | Set4 | Image97 | 80.0 | NaN | 66 | 80 | 88 |
| 97 | Set4 | Image98 | 92.0 | NaN | 73 | 93 | 95 |
| 98 | Set4 | Image99 | 116.0 | NaN | 101 | 115 | 93 |
| 99 | Set4 | Image100 | 99.0 | NaN | 77 | 106 | 102 |
100 rows × 7 columns
- 查看表格前 7 行
df.head(7)| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | NaN | 68 | 103 | 93 |
- 查看表格后 7 行
df.tail(7)| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 93 | Set4 | Image94 | 81.0 | NaN | 65 | 90 | 70 |
| 94 | Set4 | Image95 | NaN | NaN | 104 | 122 | 88 |
| 95 | Set4 | Image96 | 106.0 | NaN | 75 | 112 | 98 |
| 96 | Set4 | Image97 | 80.0 | NaN | 66 | 80 | 88 |
| 97 | Set4 | Image98 | 92.0 | NaN | 73 | 93 | 95 |
| 98 | Set4 | Image99 | 116.0 | NaN | 101 | 115 | 93 |
| 99 | Set4 | Image100 | 99.0 | NaN | 77 | 106 | 102 |
- 将某一列改为 index
df1 = df.set_index('Image')
df1.head()| Unnamed: 0 | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|
| Image | ||||||
| Image1 | Set1 | 92.0 | 93.0 | 70 | 87 | 82 |
| Image2 | Set1 | 87.0 | 83.0 | 60 | 85 | 83 |
| Image3 | Set1 | 104.0 | 98.0 | 74 | 99 | 94 |
| Image4 | Set1 | 99.0 | NaN | 73 | 101 | 109 |
| Image5 | Set1 | 89.0 | NaN | 59 | 90 | 67 |
- 查看 columns(列)名
df1.columnsIndex(['Unnamed: 0', 'Manual', 'Manual2', 'Auto_th_2', 'Auto_th_3',
'Auto_th_4'],
dtype='object')
- 去重
df['Unnamed: 0'].unique()array(['Set1', 'Set2', 'Set3', 'Set4'], dtype=object)
- 修改列名
df1 = df.rename(columns={'Unnamed: 0': 'Image_set'})
df1.columnsIndex(['Image_set', 'Image', 'Manual', 'Manual2', 'Auto_th_2', 'Auto_th_3',
'Auto_th_4'],
dtype='object')
- 显示数据类型
df.dtypesUnnamed: 0 object
Image object
Manual float64
Manual2 float64
Auto_th_2 int64
Auto_th_3 int64
Auto_th_4 int64
dtype: object
- 显示统计数据
df.describe()| Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|
| count | 94.000000 | 3.000000 | 100.000000 | 100.000000 | 100.000000 |
| mean | 100.021277 | 91.333333 | 76.370000 | 97.580000 | 93.210000 |
| std | 11.285140 | 7.637626 | 11.971055 | 12.327337 | 14.128769 |
| min | 80.000000 | 83.000000 | 55.000000 | 71.000000 | 63.000000 |
| 25% | 90.250000 | 88.000000 | 67.750000 | 89.500000 | 83.750000 |
| 50% | 101.000000 | 93.000000 | 74.500000 | 98.500000 | 93.000000 |
| 75% | 108.000000 | 95.500000 | 85.000000 | 106.000000 | 103.250000 |
| max | 120.000000 | 98.000000 | 109.000000 | 124.000000 | 129.000000 |
37 - Introduction to Pandas - Data Manipulation
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
- 删除某列
df1 = df.drop('Manual2', axis=1)
df1.head()| Unnamed: 0 | Image | Manual | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | 59 | 90 | 67 |
- 删除多列
df2 = df.drop(['Manual2', 'Auto_th_2'], axis=1)
df2.head()| Unnamed: 0 | Image | Manual | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | 90 | 67 |
- 添加并填充一列
df['Date'] = '2019-06-24'
df.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | Date | |
|---|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 | 2019-06-24 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 | 2019-06-24 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 | 2019-06-24 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 | 2019-06-24 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 | 2019-06-24 |
df.dtypesUnnamed: 0 object
Image object
Manual float64
Manual2 float64
Auto_th_2 int64
Auto_th_3 int64
Auto_th_4 int64
Date object
dtype: object
- 将字符串转换成时间
df['Date'] = pd.to_datetime('2019-06-24')
df.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | Date | |
|---|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 | 2019-06-24 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 | 2019-06-24 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 | 2019-06-24 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 | 2019-06-24 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 | 2019-06-24 |
df.dtypesUnnamed: 0 object
Image object
Manual float64
Manual2 float64
Auto_th_2 int64
Auto_th_3 int64
Auto_th_4 int64
Date datetime64[ns]
dtype: object
- 保存成 .csv 文件
df.to_csv('data/manual_vs_auto_updated.csv')- 删除某行
df1 = df.drop(df.index[1])
df1.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | Date | |
|---|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 | 2019-06-24 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 | 2019-06-24 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 | 2019-06-24 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 | 2019-06-24 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 | 2019-06-24 |
- 删除前 10 行
df1 = df.iloc[10:,]
df1.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | Date | |
|---|---|---|---|---|---|---|---|---|
| 10 | Set1 | Image11 | 91.0 | NaN | 61 | 87 | 77 | 2019-06-24 |
| 11 | Set1 | Image12 | 119.0 | NaN | 79 | 105 | 111 | 2019-06-24 |
| 12 | Set1 | Image13 | NaN | NaN | 65 | 90 | 84 | 2019-06-24 |
| 13 | Set1 | Image14 | 117.0 | NaN | 94 | 115 | 105 | 2019-06-24 |
| 14 | Set1 | Image15 | 91.0 | NaN | 66 | 99 | 70 | 2019-06-24 |
- 选取某些行
df1 = df[df['Unnamed: 0'] != 'Set1']
df1.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | Date | |
|---|---|---|---|---|---|---|---|---|
| 25 | Set2 | Image26 | 102.0 | NaN | 85 | 103 | 105 | 2019-06-24 |
| 26 | Set2 | Image27 | 93.0 | NaN | 76 | 84 | 98 | 2019-06-24 |
| 27 | Set2 | Image28 | 83.0 | NaN | 62 | 71 | 87 | 2019-06-24 |
| 28 | Set2 | Image29 | 110.0 | NaN | 92 | 117 | 85 | 2019-06-24 |
| 29 | Set2 | Image30 | 89.0 | NaN | 70 | 96 | 81 | 2019-06-24 |
38 - Introduction to Pandas - Data Sorting
- 排序
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df2 = df.sort_values('Manual', ascending=True) # ascending: 升序- 选取某行/列
df2[['Manual', 'Auto_th_2']]| Manual | Auto_th_2 | |
|---|---|---|
| 34 | 80.0 | 58 |
| 96 | 80.0 | 66 |
| 93 | 81.0 | 65 |
| 66 | 81.0 | 65 |
| 44 | 82.0 | 67 |
| ... | ... | ... |
| 32 | NaN | 66 |
| 59 | NaN | 74 |
| 79 | NaN | 69 |
| 82 | NaN | 64 |
| 94 | NaN | 104 |
100 rows × 2 columns
df[20: 30]| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 20 | Set1 | Image21 | 89.0 | NaN | 65 | 94 | 86 |
| 21 | Set1 | Image22 | 88.0 | NaN | 66 | 96 | 83 |
| 22 | Set1 | Image23 | 106.0 | NaN | 71 | 112 | 105 |
| 23 | Set1 | Image24 | 107.0 | NaN | 92 | 91 | 111 |
| 24 | Set1 | Image25 | 108.0 | NaN | 93 | 113 | 115 |
| 25 | Set2 | Image26 | 102.0 | NaN | 85 | 103 | 105 |
| 26 | Set2 | Image27 | 93.0 | NaN | 76 | 84 | 98 |
| 27 | Set2 | Image28 | 83.0 | NaN | 62 | 71 | 87 |
| 28 | Set2 | Image29 | 110.0 | NaN | 92 | 117 | 85 |
| 29 | Set2 | Image30 | 89.0 | NaN | 70 | 96 | 81 |
- loc 方法是通过行、列的名称或者标签来寻找我们需要的值。
Pandas 读取某列、某行数据——loc、iloc 用法总结_子木同学的博客-CSDN 博客_pandas iloc
df.loc[20: 30, ['Manual', 'Auto_th_2']]| Manual | Auto_th_2 | |
|---|---|---|
| 20 | 89.0 | 65 |
| 21 | 88.0 | 66 |
| 22 | 106.0 | 71 |
| 23 | 107.0 | 92 |
| 24 | 108.0 | 93 |
| 25 | 102.0 | 85 |
| 26 | 93.0 | 76 |
| 27 | 83.0 | 62 |
| 28 | 110.0 | 92 |
| 29 | 89.0 | 70 |
| 30 | 115.0 | 77 |
set2_df = df[df['Unnamed: 0'] == 'Set2']
set2_df.head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 25 | Set2 | Image26 | 102.0 | NaN | 85 | 103 | 105 |
| 26 | Set2 | Image27 | 93.0 | NaN | 76 | 84 | 98 |
| 27 | Set2 | Image28 | 83.0 | NaN | 62 | 71 | 87 |
| 28 | Set2 | Image29 | 110.0 | NaN | 92 | 117 | 85 |
| 29 | Set2 | Image30 | 89.0 | NaN | 70 | 96 | 81 |
- 选取最大值
max(set2_df['Manual'])120.0
- 根据条件选取某些值
df['Manual'] > 1000 False
1 False
2 True
3 False
4 False
...
95 True
96 False
97 False
98 True
99 False
Name: Manual, Length: 100, dtype: bool
df[df['Manual'] > 100].head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | NaN | 68 | 103 | 93 |
| 7 | Set1 | Image8 | 117.0 | NaN | 77 | 122 | 88 |
| 8 | Set1 | Image9 | 104.0 | NaN | 88 | 99 | 112 |
- 复合条件
df[(df['Manual'] > 100) & (df['Auto_th_2'] < 100)].head()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | NaN | 68 | 103 | 93 |
| 7 | Set1 | Image8 | 117.0 | NaN | 77 | 122 | 88 |
| 8 | Set1 | Image9 | 104.0 | NaN | 88 | 99 | 112 |
- 遍历某些行/列
for index, row in df.iterrows():
average_auto = (row['Auto_th_2'] + row['Auto_th_3'] + row['Auto_th_4']) / 3
print(round(average_auto), row['Manual'])80 92.0
76 87.0
89 104.0
94 99.0
72 89.0
104 115.0
88 102.0
96 117.0
100 104.0
87 103.0
75 91.0
98 119.0
80 nan
...
39 - Introduction to Pandas - Grouping Data
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df = df.rename(columns = {'Unnamed: 0': 'Image_set'})
df.head()| Image_set | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
df = df.drop('Manual2', axis=1)
df.head()| Image_set | Image | Manual | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | 59 | 90 | 67 |
- 以 Image_set 为分组做统计
group_by_file = df.groupby(by=['Image_set'])
set_data_count = group_by_file.count()
set_data_avg = group_by_file.mean()set_data_count| Image | Manual | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|
| Image_set | |||||
| Set1 | 25 | 24 | 25 | 25 | 25 |
| Set2 | 25 | 24 | 25 | 25 | 25 |
| Set3 | 25 | 24 | 25 | 25 | 25 |
| Set4 | 25 | 22 | 25 | 25 | 25 |
set_data_avg| Manual | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|
| Image_set | ||||
| Set1 | 100.666667 | 72.84 | 98.04 | 92.36 |
| Set2 | 98.666667 | 75.40 | 98.00 | 93.44 |
| Set3 | 100.000000 | 78.48 | 95.52 | 94.40 |
| Set4 | 100.818182 | 78.76 | 98.76 | 92.64 |
- 统计关联性(手动与自动阈值)
df['Manual'].corr(df['Auto_th_2'])0.7381233054217538
40 - Introduction to Pandas - Dealing with missing -null- data
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df.head(8)| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | NaN | 68 | 103 | 93 |
| 7 | Set1 | Image8 | 117.0 | NaN | 77 | 122 | 88 |
- 查看缺失值
df.isnull()| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False |
| 3 | False | False | False | True | False | False | False |
| 4 | False | False | False | True | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | False | False | False | True | False | False | False |
| 96 | False | False | False | True | False | False | False |
| 97 | False | False | False | True | False | False | False |
| 98 | False | False | False | True | False | False | False |
| 99 | False | False | False | True | False | False | False |
100 rows × 7 columns
df.isnull().sum()Unnamed: 0 0
Image 0
Manual 6
Manual2 97
Auto_th_2 0
Auto_th_3 0
Auto_th_4 0
dtype: int64
- 删除缺失值
df = df.drop('Manual2', axis=1)
df2 = df.dropna()
df2.head(10)| Unnamed: 0 | Image | Manual | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | 59 | 90 | 67 |
| 5 | Set1 | Image6 | 115.0 | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | 68 | 103 | 93 |
| 7 | Set1 | Image8 | 117.0 | 77 | 122 | 88 |
| 8 | Set1 | Image9 | 104.0 | 88 | 99 | 112 |
| 9 | Set1 | Image10 | 103.0 | 69 | 98 | 94 |
df = pd.read_csv('data/manual_vs_auto.csv')
df.describe()| Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|
| count | 94.000000 | 3.000000 | 100.000000 | 100.000000 | 100.000000 |
| mean | 100.021277 | 91.333333 | 76.370000 | 97.580000 | 93.210000 |
| std | 11.285140 | 7.637626 | 11.971055 | 12.327337 | 14.128769 |
| min | 80.000000 | 83.000000 | 55.000000 | 71.000000 | 63.000000 |
| 25% | 90.250000 | 88.000000 | 67.750000 | 89.500000 | 83.750000 |
| 50% | 101.000000 | 93.000000 | 74.500000 | 98.500000 | 93.000000 |
| 75% | 108.000000 | 95.500000 | 85.000000 | 106.000000 | 103.250000 |
| max | 120.000000 | 98.000000 | 109.000000 | 124.000000 | 129.000000 |
- 填充缺失值
df['Manual'].fillna(100, inplace=True)
df.head(10)| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | NaN | 68 | 103 | 93 |
| 7 | Set1 | Image8 | 117.0 | NaN | 77 | 122 | 88 |
| 8 | Set1 | Image9 | 104.0 | NaN | 88 | 99 | 112 |
| 9 | Set1 | Image10 | 103.0 | NaN | 69 | 98 | 94 |
- 使用平均值填充缺失值
import numpy as np
df = pd.read_csv('data/manual_vs_auto.csv')
df['Manual'] = df.apply(
lambda row: (round((row['Auto_th_2'] + row['Auto_th_3'] + row['Auto_th_3']) / 3)) # 平均值
if np.isnan(row['Manual']) # 如果是缺失值的话
else row['Manual'], axis=1) # 填充在 Manual 列上df.head(10)| Unnamed: 0 | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
| 5 | Set1 | Image6 | 115.0 | NaN | 82 | 124 | 105 |
| 6 | Set1 | Image7 | 102.0 | NaN | 68 | 103 | 93 |
| 7 | Set1 | Image8 | 117.0 | NaN | 77 | 122 | 88 |
| 8 | Set1 | Image9 | 104.0 | NaN | 88 | 99 | 112 |
| 9 | Set1 | Image10 | 103.0 | NaN | 69 | 98 | 94 |
41 - Introduction to Pandas - Plotting
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df = df.rename(columns={'Unnamed: 0': 'Image_set'})
df.head()| Image_set | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | |
|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 |
- 绘制折线图
df['Manual'].plot()<AxesSubplot:>
- 绘制直方图
# 类型 hist,分组 30,标题 Manual Count,图像大小 12 * 10
df['Manual'].plot(kind='hist', bins=30, title='Manual Count', figsize=(12, 10))<AxesSubplot:title={'center':'Manual Count'}, ylabel='Frequency'>
df['Manual'].rolling(3).mean().plot()<AxesSubplot:>
- 绘制 box 图,判断平均数与中位数的关系 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html
df['Manual'].describe()count 94.000000
mean 100.021277
std 11.285140
min 80.000000
25% 90.250000
50% 101.000000
75% 108.000000
max 120.000000
Name: Manual, dtype: float64
df['Manual'].plot(kind='box', figsize=(8, 6))<AxesSubplot:>
- 散点图
df.plot(kind='scatter', x='Manual', y='Auto_th_2', title='Manual vs Auto 2')<AxesSubplot:title={'center':'Manual vs Auto 2'}, xlabel='Manual', ylabel='Auto_th_2'>
- pd.apply(): Pandas 数据处理(五) — apply() 方法介绍!
def cell_count(x):
if x <= 100.0:
return 'low'
else:
return 'high'df['cell_count_index'] = df['Manual'].apply(cell_count)
df.head()| Image_set | Image | Manual | Manual2 | Auto_th_2 | Auto_th_3 | Auto_th_4 | cell_count_index | |
|---|---|---|---|---|---|---|---|---|
| 0 | Set1 | Image1 | 92.0 | 93.0 | 70 | 87 | 82 | low |
| 1 | Set1 | Image2 | 87.0 | 83.0 | 60 | 85 | 83 | low |
| 2 | Set1 | Image3 | 104.0 | 98.0 | 74 | 99 | 94 | high |
| 3 | Set1 | Image4 | 99.0 | NaN | 73 | 101 | 109 | low |
| 4 | Set1 | Image5 | 89.0 | NaN | 59 | 90 | 67 | low |
df.to_csv('data/manual_vs_auto2.csv')df.boxplot(column='Manual', by='cell_count_index')<AxesSubplot:title={'center':'Manual'}, xlabel='cell_count_index'>
42 - Introduction to Seaborn Plotting in Python
import pandas as pd
df = pd.read_csv('data/manual_vs_auto.csv')
df['Manual'].fillna(100, inplace=True)
df = df.rename(columns={'Unnamed: 0': 'Image_Set'})import seaborn as sns
sns.distplot(df['Manual'])C:\Users\gzjzx\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Manual', ylabel='Density'>
- sns.kdeplot()核密度估计图
sns.kdeplot(df['Manual'], shade=True)
sns.kdeplot(df['Auto_th_2'], shade=True)
sns.kdeplot(df['Auto_th_3'], shade=True)
sns.kdeplot(df['Auto_th_4'], shade=True)<AxesSubplot:xlabel='Manual', ylabel='Density'>
- sns.jointplot() 双变量关系图
sns.jointplot(x='Manual', y='Auto_th_2', data=df, kind='kde')<seaborn.axisgrid.JointGrid at 0x212f9ad23d0>
- sns.pairplot() 用来展示两两特征之间的关系
sns.pairplot(df, x_vars=['Auto_th_2', 'Auto_th_3', 'Auto_th_4'], y_vars='Manual', height=6)<seaborn.axisgrid.PairGrid at 0x212f9bd0fd0>
- sns.lmplot() 展示线性关系
sns.lmplot(x='Manual', y='Auto_th_2', data=df, order=1, hue='Image_Set')<seaborn.axisgrid.FacetGrid at 0x212fa457f70>
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Manual'], df['Auto_th_2'])
slope, intercept, r_value, p_value, std_err(0.772483189743971,
-0.8937686381919718,
0.7058094587729904,
2.396963973676236e-16,
0.07831918096230937)
- sns.swarmplot() 分簇散点图
df = pd.read_csv('data/manual_vs_auto2.csv')
df['Manual'].fillna(100, inplace=True)
df = df.rename(columns={'Unnamed: 0': 'Image_Set'})
sns.swarmplot(x='Image_Set', y='Manual', data=df, hue='cell_count_index', dodge=True)<AxesSubplot:xlabel='Image_Set', ylabel='Manual'>
- sns.heatmap() 热图
corr = df.loc[:,df.dtypes == 'int64'].corr() #Correlates all int64 columns
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap=sns.diverging_palette(220, 10, as_cmap=True))<AxesSubplot:>
